† Corresponding author. E-mail:
Project supported by the National Natural Science Foundation of China (Grant Nos. 61672124, 61370145, 61173183, and 61503375) and the Password Theory Project of the 13th Five-Year Plan National Cryptography Development Fund, China (Grant No. MMJJ20170203).
Detection of community structures in the complex networks is significant to understand the network structures and analyze the network properties. However, it is still a problem on how to select initial seeds as well as to determine the number of communities. In this paper, we proposed the detecting overlapping communities based on vital nodes algorithm (DOCBVA), an algorithm based on vital nodes and initial seeds to detect overlapping communities. First, through some screening method, we find the vital nodes and then the seed communities through the pretreatment of vital nodes. This process differs from most existing methods, and the speed is faster. Then the seeds will be extended. We also adopt a new parameter of attribution degree to extend the seeds and find the overlapping communities. Finally, the remaining nodes that have not been processed in the first two steps will be reprocessed. The number of communities is likely to change until the end of algorithm. The experimental results using some real-world network data and artificial network data are satisfactory and can prove the superiority of the DOCBVA algorithm.
There are many kinds of complex systems in nature and society. Most of them can be described as networks, where the nodes (or vertexes) represent entities and the edges represent the relationship between the nodes.[1–4] The community structure is the most important feature of complex networks, and it is very important to study the community structure in the network from the perspective of theoretical research and practical application. The main reasons are four points:
(i) The research community can better understand the function and the overall structure between the nodes and sides from the local network.
(ii) We can understand the dynamics of the whole network better from the perspective of mass organizations.
(iii) We can explain the underlying causes of the network from a macro perspective and help to understand the real evolution process and mechanism of the network.
(iv) It provides a clustering method for real world networks, and does not rely on other attributes of data.
It is widely believed that communities are widespread in large networks,[5,6] where the nodes have high density of inner-group edges and low density of inter-group edges.[7–11] Recently, a large number of community detection algorithms have been proposed. Early researches focused on non-overlapping communities where a node belongs to a single community in Ref. [12], such as Kernighan-Lin algorithm in Ref. [13], and the spectral bisection algorithm on the basis of the eigenvectors of the Laplacian matrix of a network in Ref. [14]. The parameters in these methods seem to be too strict to show the link of reality social networks because a node may belong to more than one community. In fact, communities in social networks usually overlap with each other since many active users can possibly participate in multiple groups at the same time.[15] For example, a student may participate in two community activities, and the proportion between the two communities is the same. Thus, more and more researchers begin to concentrate on detecting overlapping communities.
Hierarchical clustering is a common approach which detects community based on the similarity or intensity between vertices.[16] According to this approach, the community structure is considered to be hierarchical by
adding or removing edges or nodes.[16,17] At the same time, the core vertex is also the key to detect the communities.[16] In general, the nodes with high degree are likely to be vital nodes in complex networks, and they are key points in detecting communities. Through the vital nodes, we can quickly find the center of a community. The method of detecting overlapping communities based on vital nodes algorithm (DOCBVA) uses clustering method to achieve the detection of overlapping networks. Because of the screening of vital nodes, we can find the seed communities quickly and correctly. At the same time, during the research, it is shown that most setting ranges of the parameters related to the intimacy between nodes and communities are so strict that many common nodes are ignored. Therefore, we use a new parameter to measure the intimacy between nodes and communities. In addition, in many algorithms,[7,13] the number of communities is known in advance or constant once they are identified. To solve this problem, we select the main seeds of club firstly, then the number of communities is constantly updated during the next steps. By detecting the data of the real-world networks and artificial networks, our method can get the community structures quickly and accurately.
The rest of this paper is organized as follows. First, we give the crucial concepts of our new overlapping community detection algorithm. Next, we describe the detailed realization of the algorithm. Finally, we use DOCBVA to analyze real-world networks and artificial networks and show the results and conclusions. It is worth mentioning that, in this paper, we just consider the unweighted networks.
Let G = (V, E) be a graph representing a network, where V is the set of N nodes and E is the set of M connections. M is the total number of edges in the network. We denote the network community structure as C = {C1, C2,..., Ck}. ki is the degree of vertex vi,
Our objective is to find the seed communities of networks which are more distant from each other and extend the seeds. In networks, the connection crossings between communities should be less than those inside them.[13] Unlike the case of disjoint community structures, overlapping communities allow a node to belong to more than one community. In this paper, we use three functions to realize the process of community detection.
First, in order to save the running time, we will find all nodes whose degrees are greater than the average of the network, and organize them into a set S in descending order. By doing this, we can filter out some important nodes. Then we select the vi (vi ∈ S) which has the maximum degree and has not been assigned to the seeds of communities as the seed of absorbing community. Because S is ordered, vi is the first node in S. If the vertex vj meets the following condition, traverse S and absorb the vertex vj ∈ S until no more nodes can be absorbed.
Definition
In order to explain the process of selecting the core vertex and the seed of absorbing community, Zachary’s network of karate club members is considered in Fig.
However, the community structure is not always strongly related with nodes of large degree. For example, the community in power grid can be composed of nodes with homogeneous degrees. For this type of network, our approach is still applicable. For example, figure
Second, we will extend the seed communities. It is shown that most setting ranges of parameters related to the intimacy between nodes and communities are too strict in the searching process. As shown in Fig.
It reflects how tight the node μ is with community Ci.
For the unassigned nodes, we will consider the fitness function[18]
In order to quantify the community structure of a network, Newman and Girvan[19] proposed the modularity Q as a measure of network partition
The detection of the seeds of communities is the first and also the most important phase of our method. Generally speaking, the seeds of communities provide us an overview of the network as well as the critical information of communities. First, we will get the set of vital nodes S through the pretreatment of V where the degree of every node is greater than the average degree of network, which can effectively reduce the number of nodes that need to be treated. And S is an ordered collection where nodes are sorted in descending order by degree. It is worth mentioning that a node in S belongs to only one club. For example, consider Zachary’s network of karate club members in Fig.
The details are presented in Algorithm
During the experiment, we find that the condition of I(S[0]) ∩ I(u) ≥ I(u)/2 is so severe that too many nodes are divided into a club alone. So by relaxing the condition, we add another condition that u is in the same community with S[0] when they have a complete subgraph between u, S[0], and at least two common neighbor nodes of them. The experimental results show that additional condition does not have any effect on previous condition, nor does it obtain better experimental results.
As soon as the first procedure finishes, the raw network community structure can be pictured as a collection of dense parts of the network together with outliers. Because of the condition setting in step 1, there is a low overlapping score between the two seeds. Then we just need to extend the seed communities.
For Zachary’s network, because two seeds of communities {34, 33, 3, 32, 9, 24} and {1, 2, 4, 14} have been found through Algorithm
Even when the above two procedures are executed, there will still be remaining nodes due to their less attraction to the rest of the network. Therefore, we need to revisit these nodes. Here, we use the fitness function to quantify the similarity between a node u and a neighbor community C. Fitness function is described as
The detailed description is presented in Algorithm
First, we should take at least O(n logn) to sort the vertices by degree. Then in the average case, the number of nodes we choose in S is the half of V. So, the average time complexity of Algorithm
The overlapping community detecting algorithm proposed in this paper is implemented by Java programming language running on a PC with a 2.2-GHz processor, 3.0-GB memory, and Windows 7 operating system. Some artificial network data of Lancichinetti–Fortunato–Radicchi (LFR) benchmark graphs[24] and real-world networks are used to evaluate the proposed algorithm. The real-world networks include the friendship network from Zachary’s karate club study, dolphin”s associations, college football, etc. These networks can be downloaded from Ref. [25].
The well-known karate club study of Zachary[26] is widely used as a test case for new methods in complex networks. It consists of 34 members of a karate club as vertices and 78 edges representing friendships among members of the club. By chance, a dispute arose between the club’s administrator and the club’s instructor, and as a result, the club eventually split into two smaller clubs, centered on the administrator and the instructor.[20] The network and its fission are depicted in Fig.
As shown in Fig.
It describes the associations between 62 dolphins living in Doubtful Sound, New Zealand, with ties between dolphin pairs being the statistically significant frequent association.[20] This network can be split naturally into two groups. But in this paper, we find four communities by our algorithm, which are represented by different colors in Fig.
The third network instance is the college football games in division IA in USA in the fall of 2000,[8] which is a network representing the schedule of games between American college football teams. It includes 115 teams from 12 conferences, and 616 edges represent regular season games. The Chen algorithm[27] splits the network into 13 non-overlapping communities. 10 non-overlapping communities are detected by our algorithm, as shown in Fig.
In order to further verify the performance of our algorithm, we have applied it to a number of test-case networks that are commonly used for efficiency comparisons (see Table
To further verify the performance of the algorithm, we also apply the DOCBVA to the artificial networks. Figure
To measure the similarity of ground truth communities and found communities, we use normalized mutual information (NMI), an information theoretic similarity measure. In order to measure the performance of overlapping communities, we use the extensional NMI realized by Eustace et al.[29,32] The adjusted rand index (ARI)[30–32] is also used in our method. Similar to NMI, the value of ARI is also in the range between 0 and 1. Two partitions are mutually independent when ARI is close to 0 and vice versa.[33–35] The results are shown in Figs.
Table
A new DOCBVA algorithm has been proposed to detect overlapping community structures with high efficiency. The most significant improvement of our method is the decrease of the running time. Our algorithm needs to find the seed communities and then extend them. Compared with previous algorithms, our method adopts a new parameter to estimate the intimacy between nodes and communities. The absorbing parameter α is so loose that it can accurately describe the relationship of nodes and communities. We use DOCBVA to analyze real-world networks and artificial networks. From the results we can conclude that the proposed method gives good results.
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] | |
[31] | |
[32] | |
[33] | |
[34] | |
[35] |